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ABSTRACT 

We  describe  HARC,  a  system  for  speech  understand¬ 
ing  that  integrates  speech  recognition  techniques  with 
natural  language  processing.  The  integrated  system 
uses  statistical  pattern  recognition  to  build  a  lattice  of 
potential  words  in  the  input  speech.  This  word  lattice 
is  passed  to  a  unification  parser  to  derive  all  possible 
associated  syntactic  structures  for  these  words.  The 
resulting  parse  structures  are  passed  to  a  multi-level 
semantics  component  for  interpretation. 

INTRODUCTION 

HARC,  the  BBN  Spoken  Language  System  (Boisen, 
et  al.  (1989))  is  a  system  for  speech  understanding 
that  integrates  speech  recognition  techniques  with 
natural  language  processing.  As  our  integration 
methodology,  we  use  lattice  parsing.  In  this  architec¬ 
ture,  an  acoustic  processor  produces  a  lattice  of  pos¬ 
sible  words  that  is  passed  to  a  parser  which  produces 
all  possible  parses  for  all  syntactically  permissible 
word  sequences  present  in  the  lattice.  These  parse 
trees  are  then  passed  to  a  semantic  interpretation 
component,  which  produces  the  possible  interpreta¬ 
tions  of  these  parse  structures,  filtering  out  anomalous 
readings  where  possible. 

THE  ARCHITECTURE  OF  HARC 

In  this  section,  we  present  a  more  detailed  outline  of 
the  general  architecture  of  HARC: 

1 .  An  acoustic  processor,  which  uses  context- 
dependent  Hidden  Markov  Models  (HMMs) 
for  acoustic  modelling,  produces  a  lattice  of 
possible  words  in  the  input  speech. 

2.  A  chart  parser  uses  a  unification  grammar  to 
parse  the  word  lattice  and  produces  the  set 
of  all  possible  parses  for  all  syntactically  per¬ 
missible  word  sequences  in  the  lattice.  The 
resulting  parses  are  ranked  by  acoustic 
likelihood  score. 

3.  A  multi-level  semantics  component 
processes  the  parse  trees.  ^  This  component 


'This  architecture  and  the  names  of  the  associated  language  levels 
are  from  the  PHUQA1  system  (Bronnenberg  et  al.  1980). 


uses  4  translation  steps  to  derive  the  mean¬ 
ing  of  each  parse. 

a.  The  parse  tree  is  converted  to  an  expres¬ 
sion  of  EFL  (Engiish-oriented  Formal 
Language);  at  this  level,  each  word  has 
one  EFL  constant;  this  includes  words 
with  multiple  senses. 

b.  Each  EFL  expression  is  translated  into 
one  or  more  expressions  of  WML  (World 
Model  Language).  Where  possible,  am¬ 
biguous  constants  from  an  EFL  expres¬ 
sion  are  disambiguated  and  logically 
equivalent  EFL  expressions  are  col¬ 
lapsed. 

c.  Each  WML  expression  is  converted  to  an 
expression  in  DBL  (Data  Base 
Language),  which  contains  one  constant 
for  each  file  in  the  data  base. 

d.  The  value  of  each  DBL  expression  is 
computed  by  evaluating  the  expression 
against  the  database;  this  value  is  ex¬ 
pressed  in  CVL  (Canonical  Value 
Language). 

For  speech  understanding,  semantics  identifies  the 
highest  scoring  "meaningful"  sentence.  This  sen¬ 
tence  is  the  recognized  spoken  utterance  and  its 
meaning  is  the  sytem’s  interpretation  of  the  input. 

TRAINING  AND  TEST  SETS 

To  measure  the  coverage  of  the  syntactic  and  seman¬ 
tic  components  and  the  speech  understanding  perfor¬ 
mance  of  the  integrated  system,  we  use  the  DARPA 
1000-word  Resource  Management  Database  corpus. 
This  corpus  is  divided  into  two  sets  of  sentences:^  a 
training  corpus  of  791  sentences  and  a  test  corpus  of 


^The  DARPA  database  has  well-defined  training  and  test  sets  for 
the  speech  data.  However,  for  natural  language  development  worX, 
there  is  no  such  such  well-defined  division.  For  the  purpose  of 
evaiuating  natural  language  wotX,  we  defined  at  BBN  a  training 
corpus  of  791  sentences,  based  on  791  patterns,  and  a  test  corpus  of 
200  sentences,  based  on  an  independently  selected  set  of  200 
patterns.  We  feel  these  two  corpora  are  a  reasonable  interim  solution 
for  the  language  modelling  problem  in  the  DARPA  Resource 
Management  domain. 
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200  sentences.  Syntax  and  semantics  development 
work  is  done  on  the  basis  of  the  training  corpus;  the 
test  corpus  is  kept  hidden  from  the  system 
developers,  to  simulate  novel  utterances  that  users  of 
the  system  might  make.  Periodically,  the  test  corpus 
is  run  through  the  system,  again  without  the 
developers  looking  at  any  of  the  sentences.  However, 
statistics  are  collected  on  the  percentage  of  sen¬ 
tences  successfully  processed;  this  number  can  be 
compared  to  the  percentage  of  the  training  corpus 
processed,  to  see  how  well  the  system  generalizes 
from  the  training  (known)  to  test  (unknown)  corpus. 
Subsequent  sections  of  this  paper  present  coverage 
results  on  both  training  and  test  sets  for  each  com¬ 
ponent  of  the  system. 

THE  ACOUSTIC  PROCESSOR 

Since  the  speech  understanding  search  in  our  system 
is  decoupled  into  two  phases — speech  acoustic  scor¬ 
ing  and  language  model  scoring,  to  do  this  overall 
search  most  efficiently,  we  need  to  ensure  that  suf¬ 
ficient  computing  is  performed  in  the  first  stage  and 
enough  information  is  saved  so  that  optimality  is 
preserved  in  the  later  stages  of  processing. 

To  this  end,  our  lattice  computation  algorithm  at¬ 
tempts,  in  principle,  to  compute  acoustic  likelihood 
scores  for  all  words  in  the  vocabulary  V  for  all  time 
intervals  r,  and  r^-  The  acoustic  data  is  typically  a 
sequence  of  analyzed  and  vector-quantized  (VQ)  in¬ 
put  spectra  sampled  every  10  milliseconds  (Chow,  et 
al.  1987).  We  model  the  input  speech  at  the  phonetic 
level  using  robust  context-dependent  HMMs  of  the 
phoneme.  The  acoustic  model  for  each  word  in  the 
vocabulary  is  then  derived  from  the  concatenation  of 
these  phonetic  HMMs.  Using  these  acoustic  models 
of  the  word,  one  can  compute  the  acoustic  scores  for 
all  words  in  the  input  utterance  using  a  time- 
synchronous  dynamic  time  warping  (DTW)  procedure 
with  beam  pruning. 

An  integral  part  of  the  task  of  the  acoustic  processor 
is  to  produce  a  word  lattice  that  can  be  processed 
efficiently  by  the  lattice  parser.  To  do  this,  we  reduce 
the  lattice  size  through  various  lattice  pruning  tech¬ 
niques.  We  have  used  three  pruning  techniques, 
which  we  describe  here  briefiy.  (For  full  details,  see 
Boisen,  etal.  (1989).) 

Score  Thresholding: 

Word  hypotheses  are  pruned  on  the  basis  of  the 
unit  score:  the  hypothesis’  acoustic  score  normal¬ 
ized  by  its  duration;  the  goal  is  to  keep  only  those 
acoustic  theories  with  a  unit  score  greater  than 
some  predetermined  threshold,  and  eliminate  all 
others.  In  practice,  we  found  it  nearly  impossible 
to  find  a  single  threshold  that  works  for  all  words 
and  have  adopted  a  strategy  that  uses  dual 


thresholds — one  for  short,  function  words  and 
another  for  longer,  multi-syllabic  words. 
Subsumption  Pruning: 

Subsumption  pruning  is  designed  to  explicitly 
deal  with  the  problem  of  short,  function  words, 
which  are  acoustically  unreliable  and  which  are 
often  found  throughout  the  speech  signal,  even 
within  longer  words..  Since  it  is  almost  always 
the  case  that  short  words  match  parts  of  long 
words,  not  vice  versa,  word  theories  that  are 
found  completely  inside  another  word  theory,  with 
unit  score  below  some  factor  p  of  the  parent 
theory,  are  eliminated  from  the  word  lattice. 
Forward-Backward  Pruning: 

Forward-backward  pruning  is  based  on  the 
familiar  forward-backward  algorithm  for  estimat¬ 
ing  the  parameters  of  HMMs;  it  requires  that  a// 
acoustic  theories  must  be  part  of  a  complete  path 
through  the  lattice,  and  furthermore,  must  score 
reasonably  well. 

Rather  then  determining  the  optimal  pruning  tech¬ 
nique  and  using  it  alone,  the  system  uses  these  tech¬ 
niques  in  tandem  to  try  to  produce  the  optimal  word 
lattice  in  terms  of  size  and  information  content. 

THE  SYNTACTIC  COMPONENT 

THE  GRAMMAR  FORMALISM 

HARC  uses  a  grammar  formalism  based  on  an¬ 
notated  phrase  structure  rules;  this  formalism  is  called 
the  BBN  ACFG  (for  Annotated  Context  Free 
Grammar).  While  it  is  in  the  general  tradition  of  aug¬ 
mented  phrase  structure  grammars,  its  immediate  in¬ 
spiration  is  Definite  Clause  Grammars  (DCGs) 
(Pereira  &  Warren  (1980)).  In  such  grammars,  rules 
are  made  up  of  elements  that  are  not  atomic 
categories  but,  rather,  are  complex  symbols  consist¬ 
ing  of  a  category  label  and  feature  specifications. 
Features  (also  called  arguments)  may  be  either 
constants — indicated  by  lists  in  the  BBN  ACFG — or 
variables — indicated  by  symbols  with  a  leading  colon. 
Identity  of  variables  in  the  different  elements  of  a  rule 
is  used  to  enforce  agreement  in  the  feature  indicated 
by  the  variable.  An  example  is  (features  to  be  dis¬ 
cussed  are  underlined): 

(3  (OCOMP)  ;MOOP  (WH-)  . . . ) 

(NP  :NSOBCATFRAMB  :ACR  : HP TYPE  ...) 

(VP  ;ACR  :NPTYPB  :MOOP  . . . ) 

(OPTSADJUNCT  :ACR  ...) 

where  the  variable  :AGR  enforces  agreement  be¬ 
tween  the  VP  (ultimately,  its  head  V)  and  the  subject 
NP;  iNPTYPE,  agreement  between  the  syntactic  type 
of  the  subject  NP  and  that  selected  by  the  head  V  of 
the  VP;  and  ;MOOO,  agreement  between  the  mood  of 
the  S  and  that  of  the  VP. 
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In  the  BBN  ACFG,  as  in  DCGs,  each  grammatical 
category  has  a  fixed  number  of  obiigatory,  positional 
arguments.  The  essential  difference  between  our  for¬ 
malism  and  DCGs  is  a  syntactic  typing  system, 
whereby  each  argument  position  is  limited  to  a  fixed 
number  of  values.  We  have  found  that  this  restriction, 
in  conjunction  with  the  obligatory  and  positional  nature 
of  arguments,  to  be  of  great  assistance  in  developing 
a  large  grammar  (currently  over  800  rules).  By  es¬ 
chewing  more  sophisticated  mechanisms  such  as  fea¬ 
ture  disjunction,  feature  negation,  metarules,  optional 
arguments,  and  the  use  of  attribute-value  pairs — as 
are  found  in  other  complex  feature  based  grammars, 
such  as  GPSG  (Gazdar,  et  al  (1985)),  LFG  (Bresnan 
(1982)),  and  PATR-II  (Shieber,  at  al.  (1983)) — it  is 
relatively  straightforward  to  have  a  simple  syntactic 
checker  that  ensures  that  all  grammar  rules  are  well- 
formed.  In  a  grammar  as  large  as  the  BBN  ACFG, 
having  the  ability  to  automatically  make  sure  that  all 
rules  are  well-formed  is  no  small  advantage.  We 
have  so  far  found  no  need  for  most  of  the  advanced 
facets  of  other  complex  feature  based  formalisms, 
with  the  possible  exception  of  disjunction,  which  will 
probably  be  added  in  a  restricted  form. 

An  additional  difference  between  our  work  and  stan¬ 
dard  DCGs  is  a  depth-boundedness  restriction,  which 
is  discussed  in  the  next  section. 

THE  PARSING  ALGORITHM 

The  BBN  Spoken  Language  System  uses  a  parsing 
algorithm  which  is  essentially  that  of  Graham,  Har¬ 
rison,  and  Ruzzo  (1980),  henceforth,  GHR.  This  algo¬ 
rithm,  in  turn,  is  based  on  the  familiar  Cocke-Kasami- 
Younger  (CKY)  algorithm  for  context-free  grammars. 
The  CKY  algorithm  is  quite  simple  and  powerful:  it 
starts  with  the  terminal  elements  in  a  sentence  and 
builds  successively  larger  constituents  that  contain 
those  already  found  and  constructs  all  possible 
parses  of  the  input.  However,  while  the  CKY  algo¬ 
rithm  requires  that  each  rule  introducing  non-terminal 
symbols — essentially  the  parts  of  speech,  as  op¬ 
posed  to  the  terminal  symbols  (lexical  items  and 
grammatical  formatives) — be  in  Chomsky  Normal 
Form  (i.e.  of  the  form  A  B  c,  with  exactly  two 
non-terminal  symbols  on  the  right  hand  side),  the 
GHR  algorithm  uses  several  mechanisms,  including 
tables  and  "dotted  rules",  to  get  around  this  restric¬ 
tion.  Since  the  GHR  algorithm,  like  the  CKY  algo¬ 
rithm,  deals  with  context-free  grammars,  rather  than 
context-free  grammars  annotated  with  features,  the 
use  of  the  required  feature  substitution 
mechanism — unification — is  an  extension  to  the 
GHR  algorithm;  see  Haas  (1987)  for  full  details. 

One  useful  result  of  our  work  on  extending  the  GHR 
algorithm  to  handle  annotated  context  free  grammars 
(ACFGs)  is  the  discovery  that  there  is  a  class  of 


ACFGs,  depth-bounded  ACFGs,  for  which  the  parsing 
algorithm  is  guaranteed  to  find  all  parses  and  halt 
(Haas  (1989)).  Depth-bounded  ACFGs  are  charac¬ 
terized  by  the  property  that  the  depth  of  a  parse  tree 
cannot  grow  unboundedly  large  unless  the  length  of 
the  string  also  increases.  In  effect,  such  grammars  do 
not  permit  rules  in  which  a  category  derives  only  itself 
and  no  other  children;  such  rules  do  not  seem  to  be 
needed  for  the  analysis  of  natural  languages,  so  com¬ 
putational  tractability  is  maintained  without  sacrificing 
linguistic  coverage.  The  fact  that  the  parsing  algo¬ 
rithm  for  this  class  of  ACFGs  halts  is  a  useful  result, 
since  parsers  for  complex  feature  based  grammars 
cannot  be  guaranteed  to  halt,  in  the  general  case.  By 
restricting  our  grammars  to  those  that  satisy  depth- 
boundedness,  we  can  be  sure  that  we  can  parse  input 
utterances  bottom-up  and  find  all  parses  without  the 
parser  going  into  an  infinite  loop. 

CONSTRAINING  SYNTACTIC  AMBIGUITY 

Since  the  BBN  ACFG  parser  finds  all  the  parses  for  a 
given  input,  there  is  a  potential  problem  regarding  the 
number  of  parses  that  are  found  for  each  input  ut¬ 
terance.  Our  experience  has  been  that  while  the 
average  number  of  parses  per  sentence  is  usually 
quits  reasonable  (about  2),  in  cases  of  conjunction  or 
ellipsis  the  number  of  parses  can  grow  wildly.  In  or¬ 
der  to  obtain  broad  coverage  without  explosive  am¬ 
biguity,  we  have  experimented  with  a  version  of  the 
parser  in  which  rules  are  sorted  into  different  levels  of 
grammaticality.  In  this  version  of  the  parser,  parses 
are  ranked  according  to  the  rules  utilized.  Initial  ef¬ 
forts,  in  which  ranks  were  assigned  to  rules  by  hand, 
are  encouraging.  A  version  of  the  grammar  which 
included  rules  such  as  determiner  ellipsis  that  in¬ 
creased  ambiguity,  had  an  average  of  18  parses  per 
sentence  and  a  mode  of  2  parses.  However,  when 
only  first  order  parses  were  considered  the  average 
was  2.86  parses  and  the  mode  was  1.  The  parser 
without  the  extra  rules  and  without  ranking  has  an 
average  of  3.95  parses  and  a  mode  of  1 . 

We  have  also  experimented  with  utilizing  statistical 
methods  of  assigning  probabilities  to  rules,  based  on 
the  frequency  of  occurrence  of  the  rules  of  the  gram¬ 
mar  in  the  training  corpus.  Testing  the  results  of  this 
automatic  assignment  against  a  corpus  of  48  sen¬ 
tences  (from  the  training  corpus)  that  were  manually 
parsed,  the  parse  assigned  the  top  score  by  the 
probabilistic  parser  was  correct  77%  of  the  time. 
Looking  only  at  the  top  6  parses,  the  correct  parse 
was  present  96%  of  the  time.  Since  the  success  rate 
is  96%  considering  only  the  top  6  parses,  while  50% 
of  the  sentences  have  6  or  more  parses,  this  suggests 
that  this  probabalistic  approach  is  on  the  right  track. 
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SYNTACTIC  COVERAGE 

The  current  ACFG  grammar  contains  866  rules;  of 
these,  424  Introduce  grammatical  formatives  (such  as 
the  articles  “a”,  “the”,  prepositions,  etc).  The  remain¬ 
ing  rules  handle  the  general  syntactic  constructions  of 
English.  Coverage  on  the  training  corpus  is  currently 
91  %  and  coverage  of  the  test  corpus  Is  81  %  with  this 
grammar.  The  version  of  the  grammar  used  by  the 
parser  that  utilizes  ranked  rules  contains  873  rules. 
Coverage  with  this  version  of  the  grammar  is  94%  on 
training  and  88%  on  test. 

THE  SEMANTIC  COMPONENT 

As  a  previous  section  noted,  the  semantic  processing 
of  an  input  query  takes  place  in  several  stages.  First, 
the  output  of  the  parser,  the  parse  tree,  is  passed  to 
the  structural  semantic  module.  This  produces  an 
expression  of  the  logical  language  EFL,  which  may  be 
ambiguous.  The  second  stage  of  processing  accepts 
as  input  an  expression  of  EFL  and  returns  as  output 
zero  or  more  expressions  of  the  logical  language 
WML.  The  EFL  translation  is  concerned  with  struc¬ 
tural  semantics — in  other  words,  just  the  effect  of 
syntactic  structure  on  meaning.  The  WML  translation 
is  concerned  with  lexical  semantics — the  meaning  (in 
a  given  domain)  of  particular  words. 

The  third  steps  converts  the  WML  expression  to  an 
expression  of  DBL.  This  translation  step  maps  be¬ 
tween  the  logical  structure  most  natural  in  describing 
an  application  domain  and  the  actual  structure  of 
database  files.  Finally,  the  answer  to  a  database 
query  In  DBL  is  expressed  in  a  formula  of  CVL. 

THE  LOGICAL  LANGUAGES 

Each  of  the  logical  languages  just  mentioned — EFL, 
WML,  DBL  and  CVL — is  derived  from  a  common  logi¬ 
cal  language  from  which  each  differs  only  in  the  par¬ 
ticular  constant  symbols  which  are  allowed  and  not  in 
the  operators  (the  only  exception  is  CVL,  whose 
operators  are  a  subset  of  the  operators  of  this  com¬ 
mon  language). 

This  logical  language  has  three  main  properties. 
First,  it  is  higher-order,  which  means  that  it  can  quan¬ 
tify  not  only  over  individuals,  but  over  sets,  tuples  and 
functions  as  well.  Second,  it  is  intensional,  which 
means  that  the  denotations  of  its  expressions  are  as¬ 
signed  relative  to  external  indices  of  world  and  time, 
and  it  incorporates  an  “intension”  operator  which 
denotes  the  "sense”  (with  respect  to  these  indices)  of 
an  expression.  Third,  the  languge  has  a  rich  system 
of  types  which  are  used  to  represent  semantic  selec- 
tional  restrictions  and  so  serve  to  delimit  the  set  of 
meaningful  expressions  in  the  language. 


STRUCTURAL  SEMANTICS 

The  structural  semantic  component  uses  a  set  of 
structural  semantic  rules,  paired  one  for  one  with  the 
syntactic  rules  of  the  grammar.  An  example  of  these 
rules  Is  given  below;^ 

S  NP  VP  OPTSADJUNCT 

(lanibda  (np  vp  oa) 

(oa  (intanaion  ( (q  np)  vp) ) ) ) 

This  is  the  top-level  declarative  clause  rule  given  ear¬ 
lier,  with  its  corresponding  semantic  rule.  Note  that 
there  are  three  variables  bound  in  the  lambda — np, 
vp  oa — corresponding  to  the  three  terms  on  the  right- 
hand  side  of  the  syntactic  rule — NP,  VP,  and 
OPTSADJUNCT.  During  semantic  interpretation,  the 
semantic  translations  of  these  right-hand  terms  are 
substituted  in  for  the  variables  np,  vp  and  oa  to  make 
the  interpretation  of  the  whole  clause. 

The  effect  of  this  rule  is  to  construct  a  proposition 
corresponding  to  the  application  of  the  predicate  of 
the  clause — the  VP — to  the  subject  of  the 
clause — the  NP.  This  proposition  is  modified  by  the 
optional  sentential  adjunct,  whose  semantic  trans¬ 
lation  is  applied  to  it.  Examples  of  sentential  adjuncts 
are  phrases  such  as  “during  April”,  and  “in  Africa”,  as 
well  as  adverbs  and  more  complicated  modifiers. 

LEXICAL  SEMANTICS 

The  lexical  semantic  component  is  concerned  with  the 
specific  meanings  of  a  word  in  a  subject  domain,  as 
opposed  to  the  manner  in  which  these  meanings  are 
combined.  These  specific  meanings  are  represented 
by  expressions  of  WML  which  are  associated  by  rules 
with  the  constant  symbols  of  EFL. 

A  recursive-descent  translation  algorithm  returns  for 
each  node  of  an  EFL  expression  a  set  of  WML  trans¬ 
lations.  The  translations  for  a  constant  expression  are 
just  those  dictated  by  its  WML  translation  rule.  The 
translations  for  a  complex  expression  are  derived  by 
combining,  in  a  cartesian  product  fashion,  the  trans¬ 
lations  of  the  parts  of  the  expression.  At  each  level, 
the  set  of  possible  translations  is  filtered  to  remove 
anomalous  translations — those  which  involve  com¬ 
binations  of  WML  expressions  with  incompatible 
semantic  types. 

Sentences  which  are  semantically  ambiguous  (such 
as  “They  went  to  the  bank”)  will  simply  return  multiple 
WML  translations.  Sentences  which  have  no  non- 
anomalous  WML  translation  (and  are  therefore  con¬ 
sidered  meaningless  by  the  system)  will  return  the 
empty  set  as  the  result  of  WML  translation. 


’Feature  specificatiorts  are  omitted  here  for  conciseness. 
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SEMANTIC  COVERAGE 

Currently,  the  semantics  is  able  to  map  75%  of  the 
training  sentences  and  52%  of  the  test  sentences  to  a 
WML  expression.  The  corresponding  figures  for  CVL 
are  44%  for  training  and  32%  for  test. 

THE  LATTICE  PARSER 

The  basic  approach  we  have  taken  for  speech  under¬ 
standing  is  to  extend  the  text  parser  to  deal  with 
spoken  input.  So  instead  of  operating  on  typed  input, 
where  a  single  word  appears  unambiguously  at  each 
position,  the  parser  must  now  deal  with  speech  input, 
which  is  highly  ambiguous:  a  set  of  words  is  possible 
at  every  position  with  varying  acoustic  likelihood 
scores.  While  the  data  structure  which  is  the  input  to 
the  text  parser  is  relatively  simple — a  list  of 
words — the  input  to  the  lattice  parser  is  a  lattice  of  all 
the  possible  words  which  have  been  found.  As¬ 
sociated  with  each  word  in  the  lattice  (and  therefore 
each  grammatical  constituent)  is  a  set  of  acoustic 
match  scores,  with  each  score  corresponding  to  par¬ 
ticular  starling  and  ending  times. 

Parsing  now  consists  of  building  larger  grammatical 
constituents  from  smaller  ones  and  also  keeping  track 
of  the  resulting  acoustic  scores  as  the  new  con¬ 
stituents  span  longer  intervals.  The  parser  builds 
these  larger  constituents  word  synchronously  in  a  way 
similar  to  the  text  parser.  Two  parsing  algorithms 
have  been  implemented  for  the  lattice  parser:  the  first 
is  a  relatively  straightforward  modification  of  the  text 
parser's  algorithm  to  deal  with  the  input  word  lattice. 
The  second  is  similar,  but  supplements  the  text 
parser’s  algorithm  with  a  top-down  predictive  filtering 
component.  This  significantly  reduces  the  computa¬ 
tional  complexity  of  the  lattice  parsing  algorithm;  see 
Chow  &  Roukos  (1989)  for  fuli  details. 

Currently,  in  the  integrated  speech  understanding  sys¬ 
tem,  just  as  in  the  text  processing  system,  the  seman¬ 
tics  component  is  applied  after  all  parsing  is  done  and 
is  not  interleaved  with  the  parsing.  While  there  are 
possible  disadvantages  to  this  approach — more  com¬ 
putation  may  be  done  since  semantic  knowledge  is 
not  appiied  untii  late  in  processing — we  have  chosen 
this  method  of  integration  as  our  first  attempt  since 
the  integration  is  simple  and  clean. 

INTEGRATED  SYSTEM  PERFORMANCE 

In  this  section,  we  present  results  for  HARC  on  the 
standard  DARPA  1000-Word  Resource  Management 
speech  database  (Price,  et  ai.  (1988)),  with  600  sen¬ 
tences  (about  30  minutes)  of  training  speech  to  train 
the  acoustic  models  for  each  speaker.  For  these  ex¬ 
periments,  speech  was  sampied  at  20  kHz,  and  14 
Mel-Frequency  cepstrai  coeffients  (MFCC),  their 


derivatives  (DMFCC),  plus  power  (RO)  and  derivative 
of  power  (DRO)  were  computed  for  each  10  ms,  using 
a  20  ms  analysis  window.  Three  separate  8-bit 
codebooks  were  created  for  each  of  the  three  sets  of 
parameters  using  K-means  vector  quantization  (VQ). 
The  experiments  were  conducted  using  the  multi¬ 
codebook  paradigm  in  the  HMM  models,  where  the 
output  of  vector  quantizer,  which  consists  of  a  vector 
of  3  VQ  codewords  per  10  ms  frame,  is  used  as  the 
input  observation  sequence  to  the  HMM. 

For  the  purpose  of  making  computation  tractable,  we 
applied  the  lattice  pruning  techniques  described 
above  to  a  full  word  lattice  to  reduce  the  average 
lattice  size  from  over  2000  word  theories  to  about  60^. 
At  this  lattice  size,  the  probability  of  having  the  correct 
word  sequence  in  the  lattice  is  about  98%,  which 
places  an  upperbound  on  subsequent  system  perfor¬ 
mance  using  the  language  models. 
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38.0 
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NA 

6.9 

36.4 

Figure  1 :  Recognition  Performance  of  HARC 

Figure  1  shows  the  results  averaged  across  7 
speakers,  using  a  total  of  109  utterances,  under  4 
grammar  conditions.  As  shown,  the  grammars  tested 
include:  1)  no  grammmar:  all  word  sequences  are 
possible:  2)  the  word  pair  grammar,  containing  all 
pairs  of  words  occuring  in  the  set  of  sentences  that 
was  used  to  define  the  database;  3)  the  syntactic 
grammar  alone;  and  4)  semantic  interpretation  for  a 
posteriori  filtering  on  the  output  of  lattice  parsing. 

Note  that  the  performance  using  the  syntactic  lan¬ 
guage  model  is  7.5%  error.  At  a  perplexity  of  700,  its 
performance  should  be  closer  to  the  no  grammar 
case,  which  has  a  perplexity  of  1 000  and  an  error  rate 
of  about  1 5%.  We  hypothesize  that  perplexity  alone 
is  not  adequate  to  predict  the  quality  of  a  language 
model.  In  order  to  be  more  precise,  one  needs  to  iook 
at  acoustic  perpiexity.  a  measure  of  how  well  a  lan¬ 
guage  model  can  selectively  and  appropriately  limit 
acoustic  confusability.  A  linguistically  motivated  lan¬ 
guage  model  seems  to  do  just  that — at  least  in  this 
limited  experiment.  Aiso,  surprisingly,  using  seman¬ 
tics  gave  insignificant  improvement  in  the  overall  per¬ 
formance.  One  possible  explanation  for  this  is  that 


^60  word  theories  corresponds  to  about  4000  acoustic  scores. 
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semantics  gets  to  filter  only  a  small  number  of  the 
sentences  accepted  by  syntax.  Out  of  the  sentences 
which  receive  semantic  interpretations,  syntax  alone 
determined  the  correct  sentence  better  than  60  per¬ 
cent  of  the  time,  leaving  only  about  20  sentences  in 
which  the  semantics  has  a  chance  to  correct  the  error. 
Unfortunately,  of  these  errorful  answers,  most  were 
semantically  meaningful,  although  there  were  some 
exceptions.  Pragmatic  information  may  be  a  higher 
level  knowledge  source  to  constrain  the  possible  word 
sequences,  and  therefore  improve  performance. 
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